Multilingual Resources for Entity Extraction
نویسندگان
چکیده
Progress in human language technology requires increasing amounts of data and annotation in a growing variety of languages. Research in Named Entity extraction is no exception. Linguistic Data Consortium is creating annotated corpora to support information extraction in English, Chinese, Arabic, and other languages for a variety of US Governmentsponsored programs. This paper covers the scope of annotation and research tasks within these programs, describes some of the challenges of multilingual corpus development for entity extraction, and concludes with a description of the corpora developed to support this research.
منابع مشابه
Shared Resources for Multilingual Information Extraction and Challenges in Named Entity Annotation
Progress in natural language processing requires increasing amounts of data and annotation in a growing variety of languages, and research in named entity extraction is no exception. While the value of richlyannotated, large-scale multilingual corpora is undeniable, costs for producing such data are high, underscoring the value of shared resources. As part of the US Governmentsponsored Automati...
متن کاملMultilingual and Cross-lingual Timeline Extraction
In this paper we present an approach to extract ordered timelines of events, their participants, locations and times from a set of multilingual and crosslingual data sources. Based on the assumption that event-related information can be recovered from different documents written in different languages, we extend the Cross-document Event Ordering task presented at SemEval 2015 by specifying two ...
متن کاملExpanding a multilingual media monitoring and information extraction tool to a new language: Swahili
The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugg...
متن کاملTIDES Language Resources: A Resource Map for Translingual Information Access
Continuing improvements in human language algorithms, coupled with improvements in digital storage and processing, inspire growing confidence in multilingual information access systems. Systems exist to transcribe broadcast news, segment broadcasts into individual stories and sort them by topic. These technologies, useful in isolation, are now being combined to produce intelligent multilingual ...
متن کاملCreation of Reusable Components and Language Resources for Named Entity Recognition in Russian
This paper describes the development of the RussIE system in which we experimented with the creation of reusable processing components and language resources for a Russian Information Extraction system. The work was done as part of a multilingual project to adapt existing tools and resources for HLT to new domains and languages. The system was developed within the GATE architecture for language...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003